{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<h1>General Instructions</h1>\n",
    "<ul>\n",
    "<li>You need to change the country code for Romania from ROU to ROM in order to match</li>\n",
    "<li>For World Bank data, you need to remove the first two columns from the CSV file</li>\n",
    "</ul>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 48,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "import numpy as npt"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<h2>Reading the full country features file</h2>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 49,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "df = pd.read_csv('../data/country_info.csv')\n",
    "meta = ['Country Code']"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<h2>Adding the population data as taken from World Bank</h2>\n",
    "<p> Total population is based on the de facto definition of population, which counts all residents regardless of legal status or citizenship--except for refugees not permanently settled in the country of asylum, who are generally considered part of the population of their country of origin. The values shown are midyear estimates.\n",
    "http://data.worldbank.org/indicator/SP.POP.TOTL?display=default\n",
    "</p>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 50,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "#in the csv file I changed ROU to ROM\n",
    "years = range(1985,2006)\n",
    "years = map(str, years) \n",
    "#years = ['1985', '1990', '1995', '2000', '2005']\n",
    "indicator_name='population'\n",
    "data_df = pd.read_csv('../data/raw/country_population_worldbank.csv',usecols=years+meta)\n",
    "data_df.columns = ['code']+[indicator_name+'_'+year for year in years]\n",
    "df = df.merge(data_df,left_on = 'code', right_on = 'code')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<h2>Adding life expectency data as taken from World Bank</h2>\n",
    "<p> Life expectancy at birth indicates the number of years a newborn infant would live if prevailing patterns of mortality at the time of its birth were to stay the same throughout its life. http://data.worldbank.org/indicator/SP.DYN.LE00.IN?display=default</p>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 51,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "#in the csv file I changed ROU to ROM\n",
    "years = range(1985,2006)\n",
    "years = map(str, years) \n",
    "\n",
    "#years = ['1985', '1990', '1995', '2000', '2005']\n",
    "indicator_name = 'life_expectency'\n",
    "data_df = pd.read_csv('../data/raw/country_lifeExpectency_worldbank.csv',usecols=years+meta)\n",
    "data_df.columns = ['code']+[indicator_name+'_'+year for year in years]\n",
    "df = df.merge(data_df,left_on = 'code', right_on = 'code')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<h2>Adding education duration data as taken from World Bank</h2>\n",
    "<p>Duration of primary is the number of grades (years) in primary education.\n",
    "<br>\n",
    "http://data.worldbank.org/indicator/SE.PRM.DURS?display=default</p>\n",
    "\n",
    "<p>Duration of secondary education is the number of grades (years) in secondary education (ISCED 2 and 3). \n",
    "<br>\n",
    "http://data.worldbank.org/indicator/SE.SEC.DURS?display=default</p>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 52,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "#in the csv file I changed ROU to ROM\n",
    "years = range(1985,2006)\n",
    "years = map(str, years) \n",
    "\n",
    "#years = ['1985', '1990', '1995', '2000', '2005']\n",
    "indicator_name = 'years_of_primary_education'\n",
    "data_df = pd.read_csv('../data/raw/country_primary_education_duration_worldbank.csv',usecols=years+meta)\n",
    "data_df.columns = ['code']+[indicator_name+'_'+year for year in years]\n",
    "df = df.merge(data_df,left_on = 'code', right_on = 'code')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 53,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "#in the csv file I changed ROU to ROM\n",
    "#years = ['1985', '1990', '1995', '2000', '2005']\n",
    "years = range(1985,2006)\n",
    "years = map(str, years) \n",
    "\n",
    "indicator_name = 'years_of_secondary_education'\n",
    "data_df = pd.read_csv('../data/raw/country_secondary_education_duration_worldbank.csv',usecols=years+meta)\n",
    "data_df.columns = ['code']+[indicator_name+'_'+year for year in years]\n",
    "df = df.merge(data_df,left_on = 'code', right_on = 'code')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<h2>Adding Labor Force Participation data as taken from World Bank</h2>\n",
    "<p>Labor force participation rate, total (% of total population ages 15+) (national estimate)<br>\n",
    "http://data.worldbank.org/indicator/SL.TLF.CACT.NE.ZS?display=default</p>\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 54,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "#in the csv file I changed ROU to ROM\n",
    "#years = ['1985', '1990', '1995', '2000', '2005']\n",
    "years = range(1985,2006)\n",
    "years = map(str, years) \n",
    "\n",
    "\n",
    "indicator_name = 'laborForce_patricipation'\n",
    "data_df = pd.read_csv('../data/raw/country_laborForceParticipation_worldbank.csv',usecols=years+meta)\n",
    "data_df.columns = ['code']+[indicator_name+'_'+year for year in years]\n",
    "df = df.merge(data_df,left_on = 'code', right_on = 'code')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<h2>Adding country income category</h2>\n",
    "<p>From the metadata file that is downloaded with any tables from the world bank. The groups are:<br>['Upper middle income', 'Lower middle income', 'High income: OECD',\n",
    "       'Low income', 'High income: nonOECD']</p>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 55,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "#in the csv file I changed ROU to ROM\n",
    "indicator_name = 'IncomeGroup'\n",
    "data_df = pd.read_csv('../data/raw/country_income_category.csv',usecols=[indicator_name]+meta)\n",
    "data_df.columns = ['code']+[indicator_name]\n",
    "df = df.merge(data_df,left_on = 'code', right_on = 'code')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<h2>Adding country Gross Savings % from GDP</h2>\n",
    "<p>I'll add description here</p>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 56,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "#in the csv file I changed ROU to ROM\n",
    "#years = ['1985', '1990', '1995', '2000', '2005']\n",
    "years = range(1985,2006)\n",
    "years = map(str, years) \n",
    "\n",
    "indicator_name = 'gross_savings_percent'\n",
    "data_df = pd.read_csv('../data/raw/country_gross_savings_percent_of_GDP_worldbank.csv',usecols=years+meta)\n",
    "data_df.columns = ['code']+[indicator_name+'_'+year for year in years]\n",
    "df = df.merge(data_df,left_on = 'code', right_on = 'code')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<h2>Adding country Inflation GDP deflator</h2>\n",
    "<p>I'll add description here</p>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 57,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "#in the csv file I changed ROU to ROM\n",
    "#years = ['1985', '1990', '1995', '2000', '2005']\n",
    "years = range(1985,2006)\n",
    "years = map(str, years) \n",
    "\n",
    "indicator_name = 'inflation_GDP_deflator'\n",
    "data_df = pd.read_csv('../data/raw/country_inflation_GDP_deflator_worldbank.csv',usecols=years+meta)\n",
    "data_df.columns = ['code']+[indicator_name+'_'+year for year in years]\n",
    "df = df.merge(data_df,left_on = 'code', right_on = 'code')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<h2>Adding country Inflation in Consumer Prices</h2>\n",
    "<p>I'll add description here</p>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 58,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "#in the csv file I changed ROU to ROM\n",
    "#years = ['1985', '1990', '1995', '2000', '2005']\n",
    "years = range(1985,2006)\n",
    "years = map(str, years) \n",
    "\n",
    "indicator_name = 'inflation_consumer_prices'\n",
    "data_df = pd.read_csv('../data/raw/country_inflation_consumer_prices_worldbank.csv',usecols=years+meta)\n",
    "data_df.columns = ['code']+[indicator_name+'_'+year for year in years]\n",
    "df = df.merge(data_df,left_on = 'code', right_on = 'code')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<h2>Industry Value Added</h2>\n",
    "<p>I'll add description here</p>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 59,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "#in the csv file I changed ROU to ROM\n",
    "#years = ['1985', '1990', '1995', '2000', '2005']\n",
    "years = range(1985,2006)\n",
    "years = map(str, years) \n",
    "\n",
    "indicator_name = 'industry_value_added'\n",
    "data_df = pd.read_csv('../data/raw/country_industry_value_added_worldbank.csv',usecols=years+meta)\n",
    "data_df.columns = ['code']+[indicator_name+'_'+year for year in years]\n",
    "df = df.merge(data_df,left_on = 'code', right_on = 'code')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<h2>Country Exports of Goods and Services</h2>\n",
    "<p>I'll add description here</p>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 60,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "#in the csv file I changed ROU to ROM\n",
    "#years = ['1985', '1990', '1995', '2000', '2005']\n",
    "years = range(1985,2006)\n",
    "years = map(str, years) \n",
    "\n",
    "indicator_name = 'exports_of_goods_and_services'\n",
    "data_df = pd.read_csv('../data/raw/country_exports_of_goods_and_services_worldbank.csv',usecols=years+meta)\n",
    "data_df.columns = ['code']+[indicator_name+'_'+year for year in years]\n",
    "df = df.merge(data_df,left_on = 'code', right_on = 'code')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<h2>Country School enrollment, tertiary gross % </h2>\n",
    "<p>I'll add description here</p>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 61,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "#in the csv file I changed ROU to ROM\n",
    "#years = ['1985', '1990', '1995', '2000', '2005']\n",
    "years = range(1985,2006)\n",
    "years = map(str, years) \n",
    "\n",
    "indicator_name = 'agriculture_value_added'\n",
    "data_df = pd.read_csv('../data/raw/country_agriculture_value_added_worldbank.csv',usecols=years+meta)\n",
    "data_df.columns = ['code']+[indicator_name+'_'+year for year in years]\n",
    "df = df.merge(data_df,left_on = 'code', right_on = 'code')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<h2>Country Agriculture, value added (% of GDP) </h2>\n",
    "<p>I'll add description here</p>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 62,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "#in the csv file I changed ROU to ROM\n",
    "#years = ['1985', '1990', '1995', '2000', '2005']\n",
    "years = range(1985,2006)\n",
    "years = map(str, years) \n",
    "\n",
    "indicator_name = 'tertiary_education'\n",
    "data_df = pd.read_csv('../data/raw/country_chool_enrollment_tertiary.csv',usecols=years+meta)\n",
    "data_df.columns = ['code']+[indicator_name+'_'+year for year in years]\n",
    "df = df.merge(data_df,left_on = 'code', right_on = 'code')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<h2>Country Total natural resources rents (% of GDP) </h2>\n",
    "<p>I'll add description here</p>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 63,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "#in the csv file I changed ROU to ROM\n",
    "#years = ['1985', '1990', '1995', '2000', '2005']\n",
    "years = range(1985,2006)\n",
    "years = map(str, years) \n",
    "\n",
    "indicator_name = 'natural_resources_rents'\n",
    "data_df = pd.read_csv('../data/raw/country_total_natural_resources_rents.csv',usecols=years+meta)\n",
    "data_df.columns = ['code']+[indicator_name+'_'+year for year in years]\n",
    "df = df.merge(data_df,left_on = 'code', right_on = 'code')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 64,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "#in the csv file I changed ROU to ROM\n",
    "#years = ['1985', '1990', '1995', '2000', '2005']\n",
    "years = range(1985,2006)\n",
    "years = map(str, years) \n",
    "\n",
    "indicator_name = 'gini_ppp'\n",
    "data_df = pd.read_csv('../data/raw/country_GNI_PPP.csv',usecols=years+meta)\n",
    "data_df.columns = ['code']+[indicator_name+'_'+year for year in years]\n",
    "df = df.merge(data_df,left_on = 'code', right_on = 'code')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<h2>Saving the final file</h2>\n",
    "<p>I'll add description here</p>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 65,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "df.to_csv('../data/country_features.csv')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<h2>Tall table (year as variable)</h2>\n",
    "<p>Prepearing for data imputation</p>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 110,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "new_data_frame= np.empty((0,7))\n",
    "for year in xrange(1985,2006):\n",
    "    b = np.array(\n",
    "            [df.code,\n",
    "             np.repeat(year,df.shape[0]).transpose(),\n",
    "             df['gdp_ppp_'+str(year)],\n",
    "             df['population_'+str(year)],\n",
    "            # df['laborForce_patricipation_'+str(year)],\n",
    "            df['years_of_secondary_education_'+str(year)],\n",
    "             df['years_of_primary_education_'+str(year)],\n",
    "             df['life_expectency_'+str(year)]\n",
    "            ]).transpose()\n",
    "    #print b.shape\n",
    "    #break\n",
    "    new_data_frame = np.concatenate((new_data_frame, b), axis=0) \n",
    "     "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 111,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "df_tall = pd.DataFrame(new_data_frame)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 112,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "df_tall.to_csv('tall_for_imputation.csv')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 2",
   "language": "python",
   "name": "python2"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 2
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython2",
   "version": "2.7.6"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 0
}
